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Summary 



Problem 

The purpose of this study was to explore the utility, from a 
psychometric and cost effectiveness standpoint, of a computerized 
adaptive measurement system in an Air Force technical training 
environment. Considering the uses a computer might be put to in a 
computer managed instructional system, adaptive testing offers 
potentially the greatest payoff, since theoretically testing time can 
be reduced substantially with either an increase in measurement 
accuracy or no decrease. This, Phase I, effort was designed to take 
the study to the point of producing an operational system ready to 
actually test technical training students adaptively. Testing and 
analyzing the results so obtained will constitute the Phase II 
effort. 

A pproach 

A thorough review of the literature in the area of adaptive 
testing was conducted. This review indicated that two testing 
techniques showed considerable promise: flexilevel testing and 
heirarchical testing. These procedures were modified by adopting 
a two-stage approach whereby a studenc would be branched into the 
testing net according to a regression estimate of his predicted 
score. This procedure will hopefully minimize testing time by 
administering items which are appropriate for the ability level of 
the examinee. Two courses were selected to implement these pro- 
cedures; block I of the Precision Measuring Equipment course was 
selected for heirarchical testing, and block IV of the Inventory 
Management course was selected for flexilevel testing. 

Resul ts 

For the two blocks of instruction, a task analysis was per- 
formed and appropriate measurement items selected. These items 
were then incorporated into a computer system for adaptive testing. 
The tes;-";.;- procedures were programmed in the TUTOR language 
supported by the PLATO system at the University of Illinois. Three 
studies were then designed to evaluate the adaptive testing approach: 
(a) a study to test and validate flexilevel testing, (b) a study 
to test and evaluate heirarchical testing, (c) a study to explore 
testing of the examinee in the criterion zone. 
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Conclusions 



The conclusions which can be drawn from the study to date 
are necessarily preliminary in nature. However, it appears that 
adaptive testing offers the potential for time savings of up to 
50%. Furthermore, it was found that a very flexible computer 
system to drive the adaptive testing strategies could be relatively 
easily developed. However, the file handeling and report generation 
capabilities of the PLATO system, in this phase of development, was 
found to require considerable ingenuity in programming. 
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I. ADAPTIVE TESTING: AN OVERVIEW 



1.0 Definition of Adaptive Testing 

In a technical training situation involving large segments of 
systematic instruction 5 the frequency with which measurement occurs and 
the number of measurement items ackninistered constitute a large time 
demand with respect to the efficiency of training. These time require- 
ments are magnified in individualized, criterion-referenced training 
situations or in any situation in which the pace at which instruction 
occurs is determined principally by a trainee's performance on tests 
(due to the accelerations and remediations effects). This is exactly 
the situation faced by Air Force trainees for whom the demonstration 
of mastery on a lesson test is prerequisite to advancement to the next 
training objective. 

Adaptive testing is a theoretical framework with associated 
computerized techniques that combine to offer solutions to the growing 
measurement challenges of individualized technical training. Adaptive 
testing is characterized by three subprocesses : (a) appropriate test 
selection and student entry, (b) tailored presentation of test items, 
and (c) sensitive scoring, diagnosis, interpretation, and reporting. 
For the first process, it is intuitively and empirically obvious that 
the test or composite test items should be selected to maximize the 
accuracy and meaningfulness of the outcome decision. In addition, a 
student should be entered into the test so as to minimize both trivial, 
easy items and highly difficult or impossibly hard items, while focusing 
on the presentation of appropriately difficult and discriminating items. 
Any adaptive test selection and entry process would have to be based on 
individual student characteristics to be valid. 

In turn, the test item presentation should be designed or 
"tailored" so as to match items to the current performance or ability 
level of the student. Simply, items that are too easy or too difficult 
for a student should be avoided. This is the essence of all tailored 
testing. Real time scoring and individualized movements based on correct/ 
error patterns are major requirements. 

Finally, the scoring procedure (right/wrong, average difficulty 
indices, average of correct item difficulty indices, etc.), the diagnos- 
tic interpretation, and the report (quantitative and/or verbal) should 
be sensitive to all the information on the student. For example, a 
bright student who is having a "bad day" should be differentially treated 
from the marginal student who is "all but eliminated." Each stage in this 
third process of adaptive testing should reflect both individual student 
data and the requirements of the training system so as to maximize students' 
learning rates and mastery performance as well as the efficiency of the 
^•**aining system. 



In essence, adaptive testing is a more comprehensive measurement 
model that optimally selects and enters students into the assessment pro- 
cess, tailors the test items, and individually scores, diagnoses, inter- 
rets, and reports outcomes to the maximal benefit of the training system, 
his report now turns to a consideration of how adaptive testing can benefit 
Air Force technical training. 

1.1 Role of Adajgtive Testing in Air Force Technical Training 

Adaptive testing is a subsystem of an adaptive training system as 
represented by the Air Force Advanced Instructional System. 1 The conceptual 
framework allows for a more integrated approach to training and measurement, 
five benefits accrue to Air Force technical training from the application of 
adaptive testing. 

First, fairly large chunks of time are devoted to evaluation in 
order to ascertain whether current training objectives have been mastered, 
and the trainee can thus advance to the next lesson. In some instances, 
the ratio of instructional time to testing time may be as high as one hour 
out of five or six; that is, as much as 16 to 20% of time on task may 'je 
devoted to evaluation. This clearly is an inordinately high percentage of 
time, and it prompts consideration of alternative strategies which would 
allow reduction of the amount of time spent in assessment, hence maximizing 
the percentage of time that can be devoted to instruction. Adaptive testing 
is, first and foremost, cost effective in that it offers a 50% or more 
reduction in measurement time. As will be revealed in the background litera- 
ture search, the time savings in improved accuracy and potential accelera- 
tion ma^y significantly increase even this time saving by reducing training 
time. 

Secondly, the use of criterion levels for passing tends to mag- 
nify measurement errors in the critical decision region. For example, is a 
student with a score of 89% correct (given a criterion of 90%) really a 
failure, and does he, therefore, require a retraining cycle? Amplification 
of this critical decision region would minimize washback and eliminate 
attrition elements. Adaptive testing improves the precision in the cri- 
terion zone in two ways. First, the borderline students can be identified 
in real time via computer techniques. They can then (a) be given a more 
discriminating sequential test, or (b) have their wrong answers subjected 
to a more detailed analysis to determine the degree of partial knowledge. 
Second, the misleading element of guessing is minimized since the adaptive 
testing model adjusts item difficulty to the student's performance level; 
this eliminates the need to guess. Student motivation is maximized by 



1 

D. N. Hansen, P. F. Merrill, R. D. Tennyson, D. B. Thomas, H. D. Kribs, 
S. Taylor, and T. G. James, The Analysis and Development of an Adaptive 
Instructional Model(s) for Individualized Technical Training , Technical 
Report for Contract No. F33615-71-C-1277, Air Force Systems Comnand, 
CTallahassee; Florida State University, 1973). 
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matching item difficulty with performance, since this avoids the demoraliz- 
ing effects of long series of unanswerable items or the tedium of simplis- 
tic questions. Thus adaptive testing has the potential to improve the 
accuracy (reliability) and precision (validity) of the outcome decision. 

Third, technical training is replete with numerous learning 
hierarchies that are structures of interrelated concepts, rules, skills, 
and subskills. The tension between theory and performance emphases in 
training reflects these hierarchies in technical training. Moreover, 
students enter career fields with partial mastery and gaps in their 
behavioral repertoires. Adaptive testing provides procedures for accu- 
rate entry and only appropriate movement within the training-measurement 
pattern for these hierarchies. The predictive power of the adaptive testing 
model allows for pretesting and acceleration around mastered subskills. The 
mixture of practice and testing can be more individualized, and save train- 
ing time through acceleration or minimized remediation. Thus an adaptive 
testing model offers an approach to optimal entry and movement within a 
required learning hierarchy. 

Fourth, as technical training becomes more individualized in order 
to gain improved training time savings, the logistics and information require- 
ments of measurement grow in geometric proportions. An adaptive testing 
model assists this 'managerial challenge by specifying essential and only 
the required student data. The automation by computers improves scoring 
accuracy, reduces instructor clerical work, and increases availability of 
information for critical decision making (e.g., elimination). An accrual 
structure can be built that more accurately predicts future successes and 
failures. Finally, the adaptive testing model can ultimately be utilized 
in the diagnostic process so as to minimize remediation time. 

Finally, adaptive testing models offer new paradigms for computer 
utilization within the training process. As general purpose digital 
computers are being employed for the management and simulation phases 
of technical training, the addition of the testing function represents 
a minor increment in computer system cost (i.e., 15% or less increase in 
cost). To optimally utilize this computing capacity, this research report 
reflects the goal of synthesizing "state-of-the-art" theoretical testing 
models into an operational model that fulfills the requirements of indi- 
vidualized Air Force Technical Training. 

1.2 Problem Structure 

The requirements of this research and development study can be 
viewed in terms of the subsequent sections of this report. First, an 
assessment of the "state-of-the-art" in adaptive testing was essential 
for identifying all feasible approaches and conceptually designing this 
Air Force adaptive testing model; Section 2 (Background Literature) will 
describe the results of this search and design process. In turn, the 
Air Force courses of Inventory Management and Precision Measuring Equip- 
ment were analyzed for potential application. Section 3 describes the 

ERIC 7 



results and delineates plans for the validation of the adaptive testing 
model This section also describes the computer implementation and 
Sistration of feasibility. Finally, the ^.^^f 1°" ^"i.^^^^^^^^^ 
dations will reflect the view of the Florida State University team for 
future extensions of the adaptive testing model. 
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II. REVIEW OF LITERATURE 



2*0 Background Literature 

In assessing the literature which underlies the advancement in 
individually oriented testing, a complex nomenclature is found to bear 
on the field. The following list, with citations, gives some concept 
of this literature: 

Adaptive testing^ 
Branched testing^ . 
Computer-assisted testing^ 
Computerized testing^ 
Flexi level testing^ 7 
Individualized testing 
Multistage testing^ 
Programmed testing^ 



"^D. J. Weiss and N. E. Betz, Ability Measurement: Conventional or 
Adaptive? , Research Report 73-1 Prepared under Contract No. N00014-67-A-0113- 
0029 NR No. 150-343, Office of Naval Research, (University of Minnesota, 1973). 



^A. G. Bayroff , Feasibility of a Programmed Testing Machine , Research 
Study 64-3, (U.S. Army Personnel Office, November, 1964). 

^J. E. Crick, "A Critical Review of Computer- As sis ted Testing" 
(Unpublished Qualifying Paper, University of Massachusetts, 1972). 

•^D. N. Hansen and G. Schwarz, An Investigation of Computer-Based Scien ce 
Testing , Institute of Human Learning Technical Report, (Tallahassee: Flori'da 
State University, 1968). 



F. M. Lord, The Self-Scoring Flexi level Test," Journal of Educational 
Measurement 8, (1971) :147-151. 



See footnote 2 above. 



See footnote 2 above. 



T. A. Cleary, R. L. Linn, and D. A. Rock, "An Exploratory Study of 
Programmed Tests," Educational and Psychological Measurement 28/ (1968) : 
345-360. 



Response-contingent testing 
Sequential item testing*^^ 
Two-stage sequential testing 



The reason for the profusion of terminology is that there is 
almost an infinite number of ways of tailoring single test items or adapt- 
ing blocks of tests to a given individual. While Lord suggested an 
emphasis on the key feature, namely tailoring the items to the individual, 
it is the contention of this review that an adaptive approach appears 
more appropriate. 13 The adaptive testing model and the associated litera- 
ture review will therefore be primarily organized in three sections, 
namely: test selection and entry processes; tailored testing; and adaptive 
scoring, diagnosis, interpretation, and reporting. 

Prior to the presentation of these main sections, a concise 
sumnary ot prior reviews may set the historical framework out of which 
the project's adaptive testing model grew. 



2.1 Literature Reviews 



During the past decade there have been numerous reviews of the 
individualized testing field. Rosenbach's review emphasized the utility 
approach to sequential testing. 14 The review by Paterson elaborated on 
the sequential probability decision rjles of Wald and their application 
to ability assessment. 1^ 16 Ferguson surveyed in depth the existing 



^®R. Wood, "Fully Adaptive Sequential Testing: A Bayesian Procedure 
for Efficient Ability Measurement," (Unpublished manuscript. University of 
Chicago, 1972). 

^^D. R. Krathwohl and R. J. Huyser, "The Sequential Item Test (SIT)," 
American Psychologist 11, (1956):419. 

1 9 

^ L. J. Cronbach and G. C. Gleser, Psychological Tests and Personnel 
Decisions (Urbana: University of Illinois Press , 1965). ^ 

^^F. M. Lord, "Some Test Theory for Tailored Testing," In W. H. Holtzman 
ed. , Computer-Assisted Instruction, Testing and Guidance (New York: Harper 
and Row, 1970). 

^^J. H. Rosenbach, "An Analysis of the Application of Utility Theory 
to the Development of Two-Stage Testing Models" (Unpublished Ph.D. disser- 
tation. University of Buffalo, 1961). 

15 

J.J. Paterson, "An Evaluation of the Sequential Method of 
Psychological Testing" (Unpublished Ph.D. dissertation, Michigan State 
University, 1962). 

^^A. Wald, Sequential Analys is, (New York: Wiley, 1947). 
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theories and methods of branching testing. Lord reviewed some of the 
major findings of tailored testing derived from theoretical studies. 
Bock and Wood, in their survey of test theory for the period from 1966 
through 1969, included a section on sequential item testing. 19 

In general, most of these reviews noted the general lack of 
empirical findings for computer-based testing and made the observation 
that conventional ability tests tend to provide more accurate measure- 
inents than tailoring strategies at the middle or median range of the 
ability distribution. (It should be noted that all of these latter find- 
ings are based on theoretical or simulation studies, and are not con- 
sistent with the limited empirical observations.) 

Weiss and Betz have provided the most extensive review of adaptive 
testing to date. 20 They have divided their review into three types of 
studies: theoretical, simulation, and empirical. This survey concisely 
summarizes their views in the following paragraphs. A brief summary of 
their final conclusions, informative as to the focus of their summaries 
concerning adaptive testing, follows. They consider that adaptive tests 
are: (1) considerably shorter than conventional tests, with little or no 
loss in validity or reliability; (2) more reliable than conventional tests 
in several studies and yielding more nearly constant precision than 
standard tests throughout the range of abilities; and (3) in several cases 
more valid, as measured against an external criterion, than are conven- 
tional tests. 21 



2.1.1 Theoretical Studies 



Weiss and Betz characterize the theoretical studies to date as 
providing a great deal of comparative information on a variety of test 
strategies, but yielding limited insight into any inferences to be made 
for real v/orld context. The rationale for this assertion is based on the 
fact that all of the empirical studies are concerned only with hypotheti- 
cal individuals and hypothetical test items. Moreover, these theoretical 
studies have validities based on a set of highly restricted assumptions 
(e.g., the probability of a correct response to an item is normally dis- 
tributed; the discrimination power of all items is constant; items vary 



17 

R. L. Ferguson, "The Development, Implementation, and Evaluation of 
a Computer-Assisted Branched Test for a Program of Individually Prescribed 
Instruction" (Unpublished Ph.D. dissertation, University of Pittsburg, 1969) 

1 o 

"^^See footnote 13 on page 10. 
19 

R. D. Bock and R. Wood, ''Test Theory,*' Annual Review of Psychology 
22, (1971): 193-223. 

on 

See footnote 2 on page 9. 

21 

See footnote 2 on page 9 (pages F8-59). 
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only in difficulty, and all scales are unidimensional in nature). Finally, 
there are no tests of significance for the information index offered by 
the theoretical studies and therefore no empirical methods for determining 
the relative differences among them. For present purposes, the theoretical 
studies can be viewed as offering a potential road map for selecting the 
most appropriate models applicable to the technical training area. 

2.1.2 Simulation Studies 

A number of studies were reviewed that simulated with real or 
generated data. Table 1 summarizes the more pertinent simulation studies. 
As in theoretical studies, Weiss and Betz comment: "They can be used simply 
as a preliminary device for the technical comparisons of cevcain adaptive 
strategies, but resultJj should not be considered definitive until they are 
replicated in empirical live testing studies. 

2.1.3 Empirical Studies 

The limited number of empirical studies reviewed by Weiss and 
Betz indicated a number of serious problems, namely, a confusion of testing 
methods, be this paper and pencil or computer, small samples or careless 
experimental procedures, etc. (see Table 2). These problems give rise to 
sdrious questions regarding the validity of the studies. In spite of the 
limitations cited, Weiss and Betz make a strong argument for empirical 
research: 

It is only through empirical studies that the actual 
effects of adaptive test administration on the testee and 
his performance will ultimately become known. Future 
empirical studies of adaptive testing should be based on 
reasonably large numbers of subjects from carefully defined 
populations, using tests based on well-structured item pools 
normed on large and appropriate groups of subjects, with 
tests pretested to obtain appropriate kinds of score distri- 
butions and probably computer-administered to reduce the 
extraneous sources of variance in test scores. ^3 

These methodological remarks will strongly influence the proposed research 
activities to be described in Section III. 

The Hansen, Hedl , and O'Neil review scanned the literature from 
a different viewpoint, namely, (a) computer-based test administration, (b) 
scoring, and (c) reports. 24 j^g review of computer-based test administra- 
tion indicated a lagging of measurement studies behind the advances in 
technological capability, particularly in the area of software. 



22 

See footnote 2 on page 9 (page 44). 

See footnote 2 on page 9 (page 43). 

24 

D. N. Hansen, J. J. Hedl, and H. F. O'Neil, Review of Automated 

Testing , Technical Memo No. 20 (Florida State University, 1971). ~ 
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Computer-based test administration may be described in terms of 
four areas of methodological activity: (a) terminal equipment, (b) the 
interactive testing process, (c) reliability and validity issues, and 
(d) the collection of multiple response indices. 

It is now possible to find typewriters, cathode ray tubes, and 
slide projectors being used for test item presentation. Since the cre- 
ation of inexpensive terminal equipment is one of the dynamic areas in 
computer technology, one can anticipate more sophisticated terminal 
devices as well as significant decrease in the cost. On the other hand, 
progress with respect to the operation of appropriate audio presentation 
units and natural speech analyzers has been discouraging. Although digit- 
alized speech as well as speech analysis devices are being investigated 
at Stanford and Haskins Laboratories respectively, the generic problems 
involved in natural speech analysis are delaying developments of new 
equipment. In regard to psychomotor/manipul ative presentations, cost 
seems to be one of the greatest deterrents to any extensive development. 
Therefore,. studies noted will focus on the cognitive/symbolic aspects of 
adaptive testing. 

Turning to the characteristics of the student-terminal interaction, 
several investigators have provided indirect evidence that this man-machine 
dialogue may be characterized as unbiased, nonstressful , and personalized 
in nature. For example. Smith points to a "confession machine effect" 
which appears to enhance the data acquisition in particular content areas 
such as the subject's personal experience or his perceived personality 
characteristics." Evans and Miller found that students responded with 
greater honesty and candor to highly personal items of a social science 
questionnaire, and Cogswell and Estavan have reported similar findings on 
the apparent confidentiality of the computer interview."^ ^' Therefore, 
the feasibility of using adaptive testing techniques on the student course 
critique appears promising. 

Evidence for the nonthreatening nature of a computer-based evalu- 
ation comes from a study by Gallagher. ^8 He investigated the relationship 



R. E. Smith, "Examination by Computer^" Behavioral Science 8, 
(1963): 76-79. 

^^W. M. Evans and J. R. Miller, "Differential Lf fee ts on Response Bias 
of Computer vs^. Conventional Administration of a Social Science Questionnaire 
An Exploratory Methodological Experiment," Behavioral Science 14(3), (1969): 
216-227. 



27 

J. F. Cogswell and D. P. Estavan, Explorations in Compute r-As sis ted 
Counseling , TM-2582, (System Development Corporation, 1965). ^" 

28 

P. D. Gallagher, An Investigation of Instructional Treatments and 
Learner Characteristics in a Computer-Managed Instruction Course , Technical 
Report No. 12, (Tallahassee: Florida State University, CAI Center, 1970). 
O 
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of instructional treatments and learner characteristics in a terminal - 
oriented computer-managed instruction course. Computer evaluation and 
instructor evaluation of term projects produced performance scores which 
were negatively related to trait anxiety (r = -.51) in the instructor- 
evaluated group, but were not related in the conputer-evaluated group 
(r = -.03). One might assume that the treatment group which emphasized 
human interaction resulted in a greater threat to the individual's self- 
esteem. 

Cronbach cites a number of advantages of computerized tailored 
testing, namely, excellence of standardization, control of bias, precision 
of timing, and the integration of learning and testing. 29 

Reliability and validity studies concerning automated adminis- 
tration procedures have demonstrated, from an empirical standpoint, the 
feasibility of a technological approach, and have paved the way for 
further research and development efforts. For example, Elwood developed 
a noncomputerized automated testing booth to administer the Wechsler 
Adult Intelligence Scale (WAIS).^^ Orr reported favorable results for 
this approach from a comparison of an automated WAIS presentation with 
a traditional WAIS presentation (r = .93). However, this system only 
provides scoring capabilities for 2 of the 11 subtests (Digit Span and 
Digit Symbol). 31 Recent computer methodology describes how the adminis- 
tration of intelligence test items can be programmed to allow for repeti- 
tion and expansion of verbal r'esponses.32 jhis more contingent, inter- 
active elicitation of responses yields equivalent (slightly superior) 
reliability and validity indices to those found for human presentation. 
This demonstrated the objective facets of computer-based testing. 

In a study of computer-based branched testing, Hansen found a 
significant improvement in internal consistency reliability for computer 
presentation (r = .80) in comparison with a conventional classroom 
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29 / 
L. J. Cronbach, Essentials of Psychological Testing 1 3rd ed.; 

New York: Harper and Row, 1970). 

on 

^^D. L. Elwood, "Automation of Psychological Testing," American 
Psychologist 24(3), (1969): 287-289. 

^ T. B. Orr, "A Comparision of the Automated Method and the Face-to- 
Face Method of Administering the Wechsler Adult Intelligence Scale," 
paper presented at the meeting of the Indiana Psychological Association, 
Indianapolis, April, 1969. 

op 

J. J. Hedl , Jr., An Evaluation of a Computer- Based Intelligence 

Test , Technical Report 21, (Tallahassee: Florida State University, 
CAI Center, 1971). 
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achievement test (r = .43). More interestingly, the computer-based test 
yielded a significant relationship (r = .76) with a college entrance apti- 
tude score. In addition, Hansen found that the addition of subjective 
confidence responses yielded improved validity coefficients. Massengill 
and Shuford have reported similar results. ^4 

Obviously, the full potential of multiple dependent measures 
remains to be empirically explored within automated testing. Multiple 
dependent measures such as latency, subjective confidence, and anxiety can 
be incorporated to improve both the diagnostic power and efficiency of the 
psychometric instruments. Research with the Minnesota Multiphasic Person- 
ality Inventory (MMPI) has shown that the information processing time 
(latency) for a given item is partially a function of the nunter of 
characters in the item, the ambiguity of the item, and the social desir- 
ability value of the item.^^ Massengill and Shuford have shown that 
subjective confidence' ratings significantly increase test reliability. 
Hansen reported an improved predictive relationship for a college entrance 
aptitude measure if confidence scores are included with the right/wrong 
CAI scores. 37 

Although the employment of computers to calculate test scores 
and to carry out statistical analyses and summaries of test data has been 
cormion for many years^, the volume has been growing at a considerable rate. 
Woods presents a comprehensive survey of the general uses of such data 
processing techniques in school testirig programs. 3° However, the applica- 
tion of these response analysis techniques to online terminal -oriented 
computer testing systems is a recen-f- advance. We turn now to the con- 
sideration of the use of natural language processing for test responses. 
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D. N. Hansen, "An Investigation of Computer-Based Science Testing," 
in R. C. Atkinson and H. A. Wilson, eJs., Computer-Assisted Instr uction : 
A Book of Readings (New York: Acaderriic Press , 1969) . 

3^H. E. Massengill and E. A. Schuford, Report on the Effect of 
"Degree of Confidence" in Student Testing (Lexington, Mass.: The 
SchufordrMassengill Corporation, 1967). 

^^T. G. Dunn, R. £. Lushene, and H. F. O'Neil, "A Complete Automation 
of the Minnesota Multiphasic Personality Inventory and a Study of its 
Response Latencies," paper presented at the annual meeting of the American 
Educational Research Association, New York City, 1971. 
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Research focusing on the computer aspects centering around input 
and output of natural language during online comtnuni cation between the 
student and the system has been reported by Starkweather; Colby, Watt, and 
Gilbert; and Weizenbaum.^^ 41 jhese authors have developed computer 
techniques to conduct psychotherapeutic dialogues with patients. Hedl , 
O'Neil, and Hansen have shown that an interactive dialogue is possible 
with the automated administration of an individualized intelligence test. 

Peck and Veldman of the University of Texas have been developing 
a computer-based system for presenting and scoring responses to a sentence 
completion test.^^ The problems of syntax were reduced due to the restric- 
tion on the subject to use a single word in responding to each sentence 
stem. The most recent system produces 40 scores from a 36-item form and 
employs a complex word-root data reduction system. This prototypic 
tailored inquiry method offers many of the benefits of a traditional inter- 
view, and might serve as a basis of future programs which could conduct 
intensive assessment interviews. 



35 

J. A. Starkweather, "COMPUTEST, a Computer Language ofNindivi dual- 
ized Testing, Instruction, and Interviewing," Psychological Reports 17, 
(1965): 227-237. 

*°M. C. Colby, v3. S, Watt* and J, P. Gilbert, "A Computer Method of 
Psychotherapy: Preliminary Communication," Journal of Nervous and Mental 
Disease 142(2), (1966): 148-152. ' 

^^J. Weizenbaum, "ELIZA-A Computer Program for the Study of Natural 
Language Conrmuni cation Between Man and Machine." Communications of the 
Association for Computing Machinery 9> (1966): 36-45. 

42 

J. J. Hedl, H. F. O'Neil, and D. N. Hansen, "Computer-Based Intelli- 
gence Testing," paper presented at the annual meeting of the American 
Educational Research Association, New York City, February, 1971. 

43 

R. F. Peck and D. J. Veldman, An Approach to Psychological Assessment 
by Computer , Research Memorandum No. 10, (Austin: University of Texas, 
1961). 

^^D. J. Veldman, "Computer-Based Sentence Completion Interviews," 
Journal of Counseling Psychology 14(2) > (1967): 153-157. 
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Recently, Archambault developed a computerized program to score 
verbal responses to three of the seven subtests of the Torrance Tests of 
Creative Thinking. 45 Subject responses to each of the subtests are scored 
for fluency, flexibility, and originality. Archambault 's data indicated 
that creativity, as defined by Torrance, was judged accurately by a com- 
puter. The syntax problems were reduced by analyzing only the frequency 
of word usage. However, this frequency word usage-or word phrase lookup 
procedure produced significant correlations ranging from .52 to .99 be- 
tween the computer and the pooled scores of four trained judges. It appears 
that the use of a computer to score open-ended responses to standardized 
test items is feasible and should be further investigated. 

In reviewing the recent research on the automated interpretation 
of test results, Hansen, Hedl, and O'Neil pointed out that the challenge 
facing such automation is the conversion of quantitative indices or pro- 
files into meaningful verbal statements. The main thrust of research in 
this area has been in the personality rather than the aptitude domain. 
Thus a number of studies hava concentrated on computerized interpretation 
of MMPI and Rorschach profiles. 47 48 49 50 51 
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Relevant to the present concern, however, are the few studies 
that have dealt with computerized interpretation of aptitude or achieve- 
ment tests. In one study. Helm programmed the evaluation of a battery of 
individual scores for each student,^^ jhe output was designed mainly to 
direct translation scores, although there was limited capability for com- 
parision and contrast of profile scores, A more innovative development 
was a program developed by Cogswell and Estavan,^^ This program was 
designed to evaluate student folders containing information such as 
grades, aptitude test scores, etc. Agreement between computer statements 
and the evaluative statements of two counselors was 75%, 

The diagnostic nature of the statements from the Cogswell and 
Estavan program is an important advance in research on automated score 
interpretation. An automated diagnostic system with interpretive capa- 
bilities could be designed to relate instructor strategies to particular 
student profiles. The system could be designed to look at both academic 
and personality variables suggesting strategies on a realtime basis. 

Another important aspect of an automated diagnostic and inter- 
pretive system is the capability for differential interpretive reporting 
according to the intended audienceo Such a system is able to provide, 
at one time, diagnostic information statements meaningful to a course 
instructor, and at another time, more sophisticated information for pro- 
fessionals engaged in research activities. 

2.2 Test Selection and Student Entry 

As can be inferred from the prior reviews, the area of computer 
selected and/or composed tests is practically nonexistent. Wood reviewed 
the techniques for computer- composed tests. The Naval CMI project at 
Memphis illustrates how students can be routed to specific tests. 
Adaptive selection of tests remains a highly promising topic for future 
research. Rasch provides a model that yields equivalent Individual 
measurement (scores) from sets of items varying in difficulty. Masang 



C. E. Helm, "Simulation Models for Psychometric Theories," In 
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proposed a procedure for item weighting to achieve invariance of test 
scores under varying test difficulty levels. 57 Obviously, a large 
storage capacity, general purpose computer allows for the composition 
of tests in real time, a near infinite solution to the problem. 

In turn, adaptive entry of a student into a test arranged in a' 
difficulty hierarchy remains unexplored. Owen has developed a procedure 
for applying Bayesian concepts to either the appropriate determination of 
a test or for the tailoring of test items to each student, the methodology 
being appropriate for each problem. The Bayesian models offer a number 
of distinct advantages: 

1. The step size of difficulty between tests can be of the examiner's 
choice. 

2. The choice of entry is dependent upon previously collected data 
on each student. 

3. The choice of scoring method is less important and is primarily 
governed by the choice of a loss function selected by the examiner. 

4. All of the test item parameters are permitted to vary. 

The unfortunate restrictive assumptions concern the uni di- 
mensionality of the ability (performance) and independence of responses 
found in all other tailor-like models. In determining the test, it 
should be chosen such that the test item parameters lead to minimized 
a posteriori variance of the ability. As the test proceeds, the posteriori 
ability of the parameter can be calculated and new estimates of the stu- 
dent's mean ability and variance can be computed with appropriate adjust- 
ments within the test as it proceeds. It should be pointed oi|t that there 
has been little theoretical or theoretical -comparative work (e.g., 
theoretical simulated comparisons), and no empirical work using this 
approach. In essence, it appears to be on the very forefront of the 
state-of-the-art. As very large computing systems becuina available, 
Bayesian models should be investigated in terms of their potential primarily 
for determining test selection; and in addition as perhaps being the most 
appropriate way of tailoring item presentations to a student. 

In turn, adaptive entry of a student into a test arranged in a 
difficulty hierarchy remains unexplored. In a more integrated instruc- 
tional and testing paradigm, Suppes has. provided for individualized entry 
for well over 50,000 students in a mathematics CAI drill and practice 



B. Masang, "Item Weighting: An Approach to Invariance of Test 
Scores Under Varying Test Difficulty Levels'' (Unpublished Preliminary 
Paper, Florida State University, 1972). 
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program. The results indicate that students can be g'lven appropriate 
entry based on the single variable of grade level and find an appropriate 
performance level within a minimum of one hour of instruction. Results 
similar to this have been reported by working with a similar public 
school population. 60 it should be observed that each of these programs 
utilized only one variable (grade level) for the predicted entry place- 
ment. If multivariate regression techniques were utilized, it would 
undoubtedly be true that a much more precise placement could be determined. 
It should be observed, though, that the evaluation of placement for adaptive 
testing will have to be determined in terms of the criterion of minimum 
number of test item presentations, since the behavioral evaluation is 
elusive at best, and perhaps impossible to answor in terms of student 
self -ratings . 

2.3 Tailored Testing 

In this section, eight different fomal models will be presented 
which provide for a form of mat':hing the item presentation to the ability 
(performance) indices of the student. For each o^ the models, a brief 
characterization and an elaboration of their advantages and limitations 
will be presented. The formal characteristics of each of the models can 
be found by searching the literature, and studying its axiomatic and 
psychometric characteristics. 



2.3.1 Sequential Item Testing Model 

Arising from statistical decision theory and the sequential proba- 
bility ratio test developed by Wald, the seauential item testing model 
establishes three decision outcome spaces: (a) success, (b) failure, 
and (c) a further test area.^-^ As test response data are collected from 
a student, the sequential analysis is performed and appropriate statistical 
inference established such that testing continues if the summed statistics 
remain in the indeterminate stage, and stops if the student is classified 
in either the success or failure outcome spaces. The earliest descriptions 
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of this model can be found in the early 1950 's and its more recent appli- 
cation has been reviewed and reported by Ferguson. ^2 63 

In general, this procedure requires approximately one half the 
number of test items as the conventional procedure. Its primary advantage 
is in its efficiency in reaching a decision. Perhaps its most desirable 
feature is for borderline students, who are given every possible oppor- 
tunity to pass until all test items are exhausted. The administration of 
additional items to the borderline student increases measurement accuracy 
and should be considered a desirable feature to be investigated in an 
empirical fashion. Unfortunately, however, the model assumes four para- 
meters; that is, the pass and failure boundary points, such as pass at .90 
and fail at .85, and the risk factors referred to as alpha and beta error 
types. Since there is no given rationale for specifying these values, 
only broad empirical study will provide a basis from which an appropriate 
selection of parameter values can be derived. Thus, in nature, this will 
undoubtedly be one of the models which will be implemented in some ultimate 
and concluding phase of adaptive testing research. 

2.3.2 Robbins Monro Procedure 

This model, like all of the tailored testing models to be reviewed, 
starts each student in a median difficulty level and successively reduces 
the stepsize between item difficulty as the test proceeds. For example, 
the step size might start out at .20 and successively come down to .01. 
Testing is continued until the student reaches some difficulty level at 
which he answers half of the items correctly and half of the items incorrectly. 
The Procedure was created by Robbins and Monro and reviewed by Wetherill 
and Lord.^^ Stocking also provides an evaluation of the process 
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68 

as a testing technique. 



The Robbins Monro process provides excellent measurement at 
high and low performance levels, but unfortunately is less efficient 
than conventional testing in the median range. In addition, the large 
number of test items required also makes it a burdensome model to imple- 
ment. The difficulty of implementing the model can be directly related 
to the observation that no empirical study has been attempted for this 
process at this date. 

2.3.3 Branching Models 

This model routes a student through a large network of test 
items according to a simplistic rule: if a student answers an item 
correctly, present a slightly more difficult next item; or, if he 
responds incorre :tly, make the next succeeding item easier. The fixed 
step size is usually set at somewhere between .025 and .05. Jh^7!So4fl 
has been described and utilized by a number of investigators.^ 

The primary advantage of the branching model is its improved 
measurement accuracy at the extremes of the performance continuum, 
although the complement is also ti^be; namely, poorer performance in the 
median range area. Its primary limitation is the large number of test 
items required for any reasonable sized network, as well as the required 
stability of the item difficulty indices which must be spaced somewhere 
between .025 and .05 for maximum efficiency. Simulation and empirical 
studies (this model has received the most extensive empirical assessment) 
indicate its superior outcome in comparislon to conventional testing, 
where it typically yields high correlations (r ^ .80) with a conventional 
test. The requirement for an exceedingly large pool of test items with 
known difficulty indices will always be its greatest deterrent. 



M. Stocking, Short Tailored Tests , Research Bulletin 69-63, 
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2.3.4 Related Branching Models 



If one adjusts the typical branching model for guessing, an 
appropriate reassignment of the upward and downward movements can be 
utilized such that upward movements are more minimal than downward 
movements. 72 This is referred to by Lord as the H-L method. ^3 
Simulation comparisons indicate that this is a less desirable model 
than the branching model and has all of its undesirable features, 
namely, the requirement for a large item pool. 

Another variant is the plicate method which adjusts the higher 
and lower branching steps according to whether the number right is odd 
or even, or some multiple thereof. Again, it has shown poor performance 
in comparision to the conventional branching model. 

2.3.5 Hybrid Model 

This model combines the shrinking and fixed step size methods 
into a single testing approach. For the first n^ test items, the diffi- 
culty is reduced in a systematic decreasing step manner. Then for the 
remaining items, a fixed step size is utilized. The main advantage 
derived lies in its efficiency in establishing an early estimate of a 
student's performance level using the shrinking step size, and then a 
refined estimate over very small incremental steps for the remaining 
items. Again, its primary advantage is improved measurement accuracy in 
the extremes of the performance continuum, but again it does not resolve 
the problems for the median range. Therefore, only simulation studies 
have been performed to date on this model. 

2.3.6 Blocked Up and Down Methods 

In order to increase the stability of the tests, the blocked 
up-down model combines several items of approximately equal difficulty 
into a single block. If a student gets one half of the items correct he 
is moved up to the next difficulty level block test. If he fails more than 
half the items, he is moved to an easier block test. Lord investigated 



A. Birnbaum, 'Some Latent Trait Models and Their Use in Inferring 
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this model utilizing two items per block. Results of the computer 
simulation indicated that the blocked up and down model is inferior to 
the single item branching model, although Cleary, Linn, and Rock indicated 
that a very high correlation coefficient was yielded by using this model 
on their simulation of test outcomes using actual test scores derived from 
known ability tests. 78 

2.3.7 Multistage Models 

Following the suggestions of Cronbach and Gleser, tests can be 
constructed so that the initial section provides a routing into three or 
four performance levels.'^ The second portion of the test, the measure- 
ment section, is then administered to each of the subgroups and used to 
derive a highly accurate score. Rock, Linn, and Cleary evaluated a two- 
stage routing procedure, a broad range routing procedure, a group dis- 
crimination routing procedure, and a sequential item sampling procedure. 
The group discrimination yielded the most satisfactory results. ^0 Subse- 
quent work by Lord indicated that the Robbins Monro process yields the 
mst accurate estimates of a student's performance .81 This is next 
followed by the branching model, and the multistage model yields satis- 
factory results over the full performance range. The biggest problem 
associated with this model lies in its original construction. Obtaining 
items of appropriate difficulty and arranging them to achieve the desired 
result is laborious and voluminous at best. On the other hand, it has 
obvious applications for a paper and pencil mode. 



2.3.8 Flexilevel Model 



Created by Lord, the flexilevel model starts a student with a 
middle difficulty item and proceeds by presenting the next easier item 
after each wrong response and the next harder item after each correct 
response. 82 Testing is stopped after r[ items where r[ is defined as 
(N + l) and M is the total number of items of the test. Lord found 
2 
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through computer simulation studies that the flexilevel model yields 
highly satisfactory results if the difficulty step size is in the range 
.033 to .067.83 This model is quite advantageous for two reasons: first, 
the reduction in test items is clearly specifiable and potential paper 
and pencil applications are also feasible. Moreover, the test item pool 
can be directly implemented from an existing conventional test, a highly 
important developmental factor. 

2.3.9 Summary of Tailored Testing 

The various problems raised by tailored testing discussed above 
are summarized by Lord as follows: "Until now, even some very primitive 
questions about how to carry out tailored testing did not have even vague 
answers. If these problems are confusing even to the psychometricians , 
how can the technical training sector have confidence in tailored testing? 
A mature summary of problems and advantages indicates the wisdom of further 
research and development. 

In some of the studies reported (e.g., Angoff and Huddleston, 
Cleary, Linn, and Rock), as many as 20% of the students were misclassified 
by the routing test. 85 86 87 in the case of conventional testing, mis- 
classification of students is similarly unavoidable, since no training 
test of today is perfectly valid and reliable. Given equivalent weakness 
for each approach, the use of improved test development methodology is the 
best course of action. 



Another serious weakness of tailored testing is that although it 
is better for .he extreme ability groups, it provides less accurate 
measurement for the average individual tha.i that of a "standard" test. 
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Lord gave tailored testing an apparent "fatal blow'' in this comnent: 



If, for example, 500 items are available for tailored 
testing, better measurement will often be obtained by 
selecting, for example, the n^ = 60 most discriminating items 
(highest a_) and administering these as a conventional test, 
rather than by using all 500 in a tailored-testing procedure. 
This may actually prove to be a fatal objection to any general 
use of tailored testing . ^ 

This remark would hold if tailored testing is applicable only to normative 
ability measurement, such as the GRE, or the SAT. However, in reaction to 
this restricted viewpoint of tailored testing. Green argued that "the 
computer's failure to improve on conventional testing in this situation ^ 
does not foreclose the possibility of computer advantages in other cases." . 
Very similar opinion was also shared by Crick who reacted: "Lord's restricted 
view of testing, while certainly a legitimate one, does not exhaust the 
possible applications of computer-assisted testing. 

In discussing the prospects of tailored testing, it seems that 
the following points are pertinent: 

1. One reason for Lord's negative comment on tailored testing is the 
strategy of comparison with a standard test (i.e., a conventional peaked 
test). Hov/ever, in comparing the tailored testing with a "published" 
(Lord's definition of a conventional unpeaked test) test, his findings 
indicated that "the tailored procedure gives more accurate measurement q. 
than the unpeaked conventional test for all students regardless of level."" 
Thus, in most technical training contexts, tailored testing is apparently 
the most effective approach. 

2. It has also been shown that tailored testing permits a drastic 
reduction of test items without much loss in the reproducibility of the 
total test scores. ^2 93 94 
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3* One novel application was made by Ferguson who used tailored 
testing in a hierarchical criterion -referenced measurement situation, ^ 
Concerning the potential usefulness of tailored testing for this purpose. 
Crick commented: "Intuitively, tailored testing makes much more sense for 
a criterion -referenced measure than for a norm-referenced measure since 
the goal of tailored testing is to adjust the test to the individual . 

4. In individualized approaches to instruction, \t seems that Lord's 
flexilevel testing may have wide applicability. In the pretest, every 
subject would take the easy set of the items; but, in the posttest, the 
subjects would take the difficult set instead. Thus, the use of the 
parallel forms of the test can be avoided. Furthermore, since the sub- 
jects would not have been exposed to many of the harder items, the carry- 
over effects of testing can be minimized. Although Lord developed the 
flexilevel testing, he has not emphasized the use of it in this context. 

5. Tailored testing is appropriate also in the affective domain of 
measurement.^' Tam found that a flexilevel model yielded reliability and 
validity indices equivalent to the total conventional test, and an 
empirically observed stop criterion reduced the test, length significantly 
beyond the 50% level . 

The prospects of tailored testing depend on willingness to explore 
its various uses, and the above list is by no means exhaustive. It is 
hoped that more rigorous explorations of tailored testing will lead to 
Green's prediction of the "inevitable computer conquest of testing. "^^ 

2.4 Adaptive Testing for Hierarchical Learning Structures 

For the instructional tasks which are hierarchical in nature, 
special adaptive testing techniques are required due to the known inter- 
dependencies. The term "hierarchy" is here used in the sense described 
by Gagne, based on his taxonomy of learning.^^ Gagne proposed that the 
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prerequisite skills for a terminal objective can be analysed so that 
lower ordered skills or behaviors would generate positive transfer to 
higher level skills. Gagrre's method of analysis begins with the termi- 
nal objective, and reiterates the following question for each subbehavior 
(subskill) identified: "What would an individual already have to know how 
to do in order to learn the new capability simply by being given verbal 
instructions?" 

A number of instances have been reported by both Gagne and 
Glaser and Nitko in which task analysis procedures have been applied to 
the study of curricula structures .^^0 lOi These include applications 
dealing with number series, algebraic equations, and elementary 
geometry. 102 103 104 others include operations with sets , fractions , 
punctuation, and capitalization of words and reading. 106 107 Glaser 
and Nitko further point out that task analysis is a growing area of 
activity among educational researchers .108 
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An initial task analysis results from a rational and subjective 
procedure, usually performed by curriculum experts. Because of this, it 
constitutes merely a hypothesized set of relationships involved in the 
subject matter. If a great deal of faith is going to be put into a 
hierarchy resulting from a task analysis, empirical validation of the 
hierarchy becomes necessary. Gagne proposes a simple analysis of data 
collected on criterion tests referenced to each and all objectives within 
the hypothesized hierarchy. The subjects on whom data have been collected 
are first categorized into tho§e who have passed the higher unit and those 
who passed the lower unit. Implications for the validity of the hierarchy 
can be made through the simple comparison of the groups .^09 

Applications of techniques for validation purposes include the 
works of Hively and Schutz, Baker, ^and Gerlach mentioned earlier, as well 
as those studies conducted by Gagne and his colleagues. HO m Further 
applications were demonstrated by Newton and Hickey; Smith and Moore; and 
Cox and Graham. ■'■12 113 114 

Another approach to the validation of task analysis attempts to 
derive from empirical data statistical indices which can then be used to 
evaluate the hypothesized hierarchy. These procedures are extensions of 
the methods of scalogram analysis and simplex analysis,!!^ Applications 
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have bss^iSl^nxJOat rated by Resnick; Resnick and Wang; and Boozer and 
Lindvall .-^l' These last researchers concluded that, in general, 

scalogram analysis seemed more applicable to within-objective hierarchies 
whereas simplex analysis seemed more appropriate for between-objecti ves 
relationships. Okey has presented an overview of the literature on 
validation of hierarchies .-^^f^ 

Hierarchically based instruction, then, has been or can be. 
developed so that tutorial, drill and practice, etc., materials cover 
each subskill, under the assumption that acquisition of all subskills 
or necessary behaviors is prerequisite to performing the terminal 
behavior. That is, based on Gagne's theory of a learning hierarchy, if 
a student can perform the terminal skill, he is also capable of performing 
the subordinate skills (or mastered the subordinate skills simultaneously 
with the terminal objective). 

The implications for testing, or for integrating testing with 
instruction, are in the location of instructional dependencies, and in 
maximizing the opportunities of the studant to participate in training 
only in those areas in which he does not already have competencies. Pre- 
testing over all objectives in a course can be predicted to be prohibi- 
tively lengthy, even though computer capabilities pei^it total pretesting 
and then branching to the lowest level objective unmas teredo- either for 
finer grained testing, or for instruction and drill, testing for mastery, 
and movement to the next objective not mastered. Tayjor^^l describes 
a model for integrating testing and instruction which precludes unnecessaril, 
testing on questions which can be assumed, from hierarchical placement, to 
be unmastered by the learner. In each section of the training sequence. 
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pretest items related to the terminal objective are presented first. The 
student who answers all items for the terminal objective correctly is 
branched on to the next section. If he fails one of the terminal objective 
pretest items, however, he is presented with pretest items for the subordi- 
nate skill. If he then fails to respond correctly to all items in this 
pretest, he is branched to the pretest for the lowest level subordinate skills. 
If he answers 100% of these items correctly, he is routed to the next level 
subordinate skills; if he does not answer 100% correctly, he is immediately 
given instruction on the skills, and is led through a number of drill and 
practice problems on the skill. The number of items presented for drill 
varies with the performance of the learner. If he answers 80% or more of 
the problems correctly on the first attempt, he is moved on to a new topic. 
Otherwise, he is branched back to the beginning of the instructional 
sequence. When he displays mastery, he is tested on the next higher skills, 
and so on. Eventually, he is tested aqain on the terminal objective. 

In a study by Taylor using these techniques, the 300 plus students 
who, on entering the integrated instruction/testing program, answered all 
33 pretest items correctly, were branched past all instructional sequences. 
Thus they were able to complete the instructional program in approximately 
20 minutes, as opposed to several hours of possible testing, drill, and 
practice for the student in the nonmastery situation. 

Ferguson utilized a sequential testing model to move students 
through an elementary mathematics hierarchy. 1^2 Grade level was utilized 
as the entry prediction and placement. The operations were judged to be 
satisfactory although lack of comparative data prevented an evaluation. 

For technical training, the optimal prediction and entry plus a 
flexible (both upward and downward) movement would be required. In addition, 
the prediction of potential transfer (synthesis of subskills) would allow 
for pretesting and training time savings if successful. Student directed 
decisions might be important in this transfer prediction process due to 
self-awareness and confidence levels. 

2.5 Scoring, Diagnosis, Interpretation, and Reports 

For this highly important third process of adaptive testing, limited 
research findings (theoretical, simulated, or empirical) have been reported. 
The historical reviews above subsume the preponderance of work today. 
Therefore, this section will focus on promising topics of further study. 

Most scoring procedures utilized the dichotomous right-wrong summed 
score. Three promising alternatives appear to be feasible. First, one 
could differentially weight items so that the most discriminating items 
relative to the criterion decision zone rather than the total score have 
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the most decisive influence. Studies of item weight indicate weighting 
can iTOCoygy. decision making as well as test psychometric characteris- 
tics. 123 •'■24 125 jhus alternative weighted scoring procedures are 
promising and feasible given a computer*s calculation capacity. 

In turn, the aggregating or sununation process for total score 
should be studied. Green posits that a mean of difficulty indices for 
correct responses offers the most accurate procedure .126 similar com- 
posite score procedures that stress minimally acceptable mastery levels 
should be investigated. 

Finally, there is important information in the error responses 
elicited from students . Bock proposes an item estimation procedure that 
yields differential information from error alternati ves .127 Intuitively, 
a "nearly correct" response is more adaptive than a "dum-dun" response. 
In turn, these error patterns may yield highly important differential 
categories of students who have partial knowledge. For one group, the 
remedial alternative of test item review would be sufficient to achieve 
mastery while the other extreme group may achieve mastery only through a 
totally new training strategy. Large student flow and a computer are 
required to implement the Bock model; fortunately. Air Force technical 
training satisfies these requirements. 

In terms of diagnostic requirements, total test scores and item 
pass-fail indices are far too summarized for instructional inference making 
Measurement in technical training should yield an individual performance 
profile that indicates the structure and "valley" of weakness. Profile 
techniques could yield insights like "the verbal indices are so low that 
only a high multimedia with audio training approach will insure mastery," 
or "the uniform pattern of indices indicates that incentives to enhance 
motivation will insure fast mastery." While speculative in nature, the 
Individual performance profiles interface directly into an adaptive instruc 
tional model at this operational juncture. 
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Interpretation of adaptive tests can be viewed as a "clinical vs. 
actuarial" challenge. As sufficient test data bases are collected, 
refined classification techniques (discriminant analysis) and statistical 
decision models can be constructed so as to improve the predictive aspects 
of the interpretation. While a futuristic form of research, the ultimate 
requirement should be investigated so as to have the full potential of 
adaptive training (instruction and testing) achieved. 

In regard to reports, the recurrent problem of understanding 
numerical or statistical outputs by instructors, supervisors, etc., will 
be present. Graphical and verbal reports should be considered and studied. 
The sufficiency ^f information for instructional decision making and 
monitori-ng is critical. As cited in the Hansen, Hec^l, and O'Neil review, 
automation of the report process is both feasible and desirable in terms 
of cost and resource utilization. A consumer survey methodology could be 
profitably employed at this stage. Obviously, adaptive tests will only be 
useful to the degree t; ^t their results are utilized in a sound, rational 
manner.' 
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III. A DESIGN FOR VALIDATION 
3.0 I mplementat i on and Demonstration of Adaptive Testing 



This section will describe the project team's experience in 
implementing and demonstrating the feasibility of adaptive testing 
paradigms. This involved fcoir major activities, in sequence: analysis 
of two Air Force technical training courses; identification of course 
sections appropriate for feasibility/validation study; programming of 
the computer; and design of appropriate follow-on validation and 
research studies. After an overview of the first three activities, 
these follow-on studies will be described in detail. 

3.1 Overview 

The initiating activity consisted of an extensive literature 
search relating to all facets of adaptive testing. This eventuated 
in the literature review presented in the prior section. After appro- 
priate consideration of the state of the art, it was recognized that 
not all of the fruitful topics raised in the literature search could 
be implemented given the constraints to be described. Therefore, 
priorities arranged according to benefits for adaptive testing appli- 
cations in technical training were delineated and used in the design 
process. The constraints confining the project in turn delimited the 
scope and priority structure of the proposed studies. 

3.1.1 Priorities . In reference to priorities, it was the project 
team's judgment that the potential savings in test length, with its 
concomitant reduction in measurement time, was first and foremost in 
importance. Secondly, demonstration of the use of computers to improve 
the accuracy of the testing process, the reduction of instructor 
involvement, and increase in information for critical decision making 
was judged essential to establishing the feasibility of a computer-based 
adaptive testing approach. 

As a third priority, the application of adaptive testing to 
hierarchical learning structures was considered highly important in 
terms of its potential implications for savings in training time. 
Fourth, the use of adaptive testing for affect, or course critique 
activities, was considered feasible within the time constraints, and 
potentially demonstrative of the breadth of the adaptive testing approach. 
Finally, the study of the marginal student (the criterion zone decision 
making problem) was judged to be critical to the technical training 
mastery learning question. These priorities are ranked according to 
importance and guided the design efforts in the layout of the proposed 
three studies . 

3.1.2 Constraints . As in all naturalistic situations, numerous 
constraints shape the nature and scope of research activities. First, 
the student flow and associated time limitations strongly influenced 
the size of the endeavors. Secondly, available resources, especially 
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computer terminals, influenced the team's approach and scope. Finally,, 
the consideration of a strategy least interruptive to the conventional 
ongoing training affected the nature of the design. In total, while the 
constraints shaped the initial feasibility study, the demonstration, and 
subsequently proposed studies, they did not prove to be insurmountable 
barriers, and they illustrate the manner in which adaptive testing can 
be readily introduced into ongoing technical training programs. 

3.1.3 Demonstration . The variable entry flexilevel testing paradigm, 
the course critique assessment, and the hierarchical structure paradigm 
were computer programmed and are currently available on the University of 
Illinois PLATO system. Given ongoing course revisions, test items are 
undergoing changes on a frequent basis. However, the basic structure of 
the tests is coded, and the tests are presently running as will be reviewed 
under Studies One and Two. As will be described, the team's experience indi- 
cated that adaptive testing is an easy measurement process to implement on 
a large general purpose computer with a viable operating system and train- 
ing-oriented language. The demonstration was therefore judged highly 
successful . 



3.2 Air Force Course Analysis and Liaison Activity 

Concurrent with the literature search for adaptive testing, the 
project staff performed a task analysis on the Inventory Management (IM) 
course, and the Precision Measuring Equipment Specialist (PME) course. 
Given prior task analyses, this process was greatly facilitated.^^-^ 
After appropriate consideration of priorities and topics, IM end-of-block 
III exam and lesson and Block IV exams were selected for demonstration of 
criterion-oriented adaptive testing. To give the reader some understanding 
of the structure of this material, the following IM course ta^ks can be 
listed: 



1. Define equipment item terms. 

2. Define Air Force Equipment Management System (AFEMS) abbreviations. 

3. Identify AFEMS organization responsibilities. 

4. Identify AFEMS chain of command. 

5. Obtain higher level approval of request for equipment authorization. 

6. Validate at base level request for equipment authorization. 

7. Establish Equipment Authorization Inventory Data (EAID)/in-use 
detail record. 

8. Identify procedures for equipment issue. 

9. Process equipment turnin. 

10. Process intercustody receipt account transfer. 
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11. Process issue and turn in of vehicles. 

12. Issue and turn in nonexpendable organizational items. 

13. Issue and turn in nonexpendable personal retention items. 

14. Issue and turn in nonexpendable tools. 

15. Issue and turn in expendable individual equipment and tools. 

16. Identify equipment inventory preparation procedures. 

17. Identify procedures/steps required to conduct a physical inventory. 

18. Identify procedures for processing a consolidated inventory 
adjustment document. 

19. Identify inventory procedures for vehicles and family quarters. 

20. Identify Industrial Plant Equipment (IPE). items and procedures. 

21. Identify War Readiness Materials (WRM) procedures. 

22. Identify nature and purpose of the EAID/in-use asset report. 

For hierarch/ical learning structure, Block I of PME was selected. 
This block in essence presents the mathematical concepts and skills 
necessary for this highly technical, electronics-oriented course. Topics 
for this course can be subsumed in the following: 



1. Applied Mathematics 

2. DC Circuit Analysis 

3. AC Circuit Analysis 

4. Vacuum Tubes and Solid State Principles and Power Supplies 

5. Solid State and Vacuum Tube Amplifiers 

6. Wave Generating and Shaping Circuits 

7. Test Equipment Troubleshooting and Repair Procedures 

8. DC and Low Frequency AC Measurement I 

9. DC and Low Frequency AC Measurement II 

To provide course liaison and cooperative design and demonstration 
activities, the research team met with course instructors and supervisory 
personnel from the IM course and the PME course during the third week of 
July, 1973. For the IM course, arrangements were made to collect the en d- 
of-block test for Blocks III and IV and the criterion progress checks' 
(CPC's) for lessons 1, 2, 8, 9, and 10 of Block IV. The block tests were 
50-item multiple choice tests covering about two weeks of instruction. 
The CPC's consisted of a combination of short answer and form completion 
items. The CPC's covered a shorter period of instruction than the block 
tests, and were used mainly to assess performance; for example, form com- 
pletion. Satisfactory performance on the CPC's is a prerequisite for 
taking the block examinations. Only those CPC's were selected that were 
compatible for Implementation on the PLATO System; that is, one word or 
short answer formats. Given the reliability of the PLATO system, graphic 
displays and natural language processing were not included in this research 
effort. Logistics of test collection, storage, and security were-discussed 
with Air Force personnel. A procedure was arrived at that interfered 
least with the normal adnini strati on of the course: 1. Block tests and 
CPC's were turned over to the civilian head of the course by the instructor 
responsible for coordinating test collection. 2. Tests were stored, and 



ERIC 



periodically forwarded to the research team for item analysis. 3. All 
forms of the tests were collected^ as well as the course critique items, 
for entry into the PLATO system. 133 

For the PME course, similar data collection procedures were 
arranged. The test items were also entered into the PLATO system. The 
same course critique items were also planned to be utilized for the PME/ 
hierarchical learning structure experiment. 

3.2.1 Item parameter analysis . In order to perform item parameter 
analysis, the block test data had to be transferred from the standard 
Air Force answer sheets to the IBM mark sense sheets. The test data 
were transferred by assistants and appropriate quality control procedures 
were followed. In order to carry out the modified flexi level testing 
procedures, certain statistical information on the test items was required, 
i.e., item difficulties and beta weights for predictions. Two types of 
analysis were performed; item parameter estimates, and stepwise multiple 
regression. Item analysis of the data was performed on a computer program 
that allows for the following: 

1. Specification of either norm- or criterion-referenced evaluation. 

2. Three types of correlations: 

(a) point-biserial correlation of item scores with total score, 

(b) biserial correlation of item scores with total score, 

(c) phi correlation of item scores with test score above or below 
criterion or median. 

3. Selective eff iciency--the biserial or point-biserial correlation, 
between item scores and test scores, corrected for the effect of 
item difficulty. 

4. Reliability: 

(a) norm-referenced (KR-20), 

(b) criterion-referenced. 

5. Output: 

(a) the above correlations for correct and incorrect alternatives » 

(b) frequency distribution of raw scores, 

(c) descriptive statistics for each item and all items combined, 

(d) a list of items sequenced according to difficulty and 
discrimination index, 

(e) a list of students by name or ID number with raw score, percent 
correct, percentile, and standard score (T-score), 

(f) feedback to each student regarding performance relative to 
the criterion. 

An example of this output is supplied in Appendix A, Computer Program 
Output. 
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Since course critique items were to be flexi level administered, the 
standard Air Force format could not be enployed in all cases. * In the place 
of the standard Air Force format, four 21-item critiques were developed by 
the research team. Each of. the new course critiques centered on only one 
O ect of course instruction, such as i nstructor effectiveness . 
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In reference to the predictive entry requirement of the test, 
appropriate stepwise regression analyses were performed on a preliminary 
group of IQQ. students . An available "canned'* program supplied the 
following:134 



1. Summary statistics on each variable. 

2. Correlation matrix. 

3. At each step, 

(a) analysis of variance table, 

(b) multiple R and R square, 

(c) beta weights , 

(d) partial correlations, and F and tolerance values for variables 
not in the equation, 

Summary table for all variables in the equation, 



4. Summary table for all variables in the equ 

Using the above regression procedure, it w 
predictive model based on the following data: 



as possible to generate 



1. Class standing 

2. AFQT score 

3. AQE administrative score 

4. AQE general score 

5. AQE electrical score 

6. AQE mechanical score 

7. Block I written score 

8. Block II written score 

9. Block III written score 
10. Block IV written score 

Preliminary examflles of these relationships are presented in Table 3. 
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As sample size grows sufficiently large, more stable estimates of these 
predictions will be found. In addition, ongoing study of fluctuations 
in the beta weights, as successive numbers are added, will assess vari- 
ability and measurement of error. 

3.3 Study One—Flexilevel Validation Study 

The purpose of , this study is to validate the adaptive testing 
paradigm (predictive entry and tailored item presentation) for both a 
knowledge-oriented test and a student-evaluated course critique. The 
primary goal of the paradigm is a significant reduction in testing time. 
Using a within-subject design (N = 200 or more), each student will be 
individually entered in the test and given the flexilevel adaptive item 
movement procedure. After the student completes the adaptive test, all 
of the remaining items will be presented. Conventional and adaptive forms 
of the course critique will be given at the ends of Blocks III and IV. 

The independent variables will be (a) individualized entry based 
on regression techniques, using AFQT, clerical. Block I, and Block II scores, 
and (b) the flexilevel algorithm with its final score, and (c) the adaptive 
course critique form. The dependent measures will be (a) conventional 
.total scores, (b) item latencies, and (c) total test times. For analysis, 
correlational techniques should reveal that the relationship of the adaptive 
score and the total score is greater than .9, and that the predictive entry 
yields a relationship at .70 or greater with the above two variables. 

Utilizing Lord's prediction, there should be an approximate 50% 
reduction in test items with a 40% reduction in test time for the adaptive 
test. (Analysis of variance techniques will be used.) For item latencies, 
three significantly different distributions can be predicted; blocking 
data at .05 difficulty about the final adaptive score, the very hard items 
will have longer latency than the expected performance level, v^hich will 
be longer than the easier items. For the course critique, there will be a 
high relationship between the two forms, and instructor feedback should be 
more positive for the adaptive form. 

Conventional reliability assessment wil 1 be applied to all test 
forms at both item and fom levels. ^ 

3.3.1 Computer implementation ^ The total test paradigm has been 
programmed in the Tutor language of the PLATO system. From a student point 
of view, the procedure runs as follows: the student (a) signs on the com- 
puter terminal, (b) enters control processing, (c) the system selects the 
test and entry level for him, and (d) executes the adjusted flexilevel item 
presentation which will assess his performance. After he has completed the 
adaptive portion of the test, all remaining items are presented. If he has 
not achieved mastery based on standard Air Force scoring procedures, he 
receives offline remediation. If he has demonstrated an acceptable level 
of performance, the system then decides whether to (a) assign the next 
flexilevel test, reenter the student in control processing and once again 



begin the flexi level sequence, or (b) sign him off, an option available to 
the instructor for acceleration. Figure 1 presents a flowchart of a stu- 
dent moving through each of these answers. A more detailed description 
f ol 1 ows . 

In signing on, the student enters his name and the computer executes 
a security check designed to limit system accessibility and assure test 
security. Once he has completed the required sign-on activities, the com- 
puter system checks his performance record and aptitude profile to determine 
which of the 10 tests he is ready to take. The system also determines his 
entry level in the chosen test. Thus, the student is provided the most timely 
entry test point in terms of his recorded performance, aptitudes, and 
current in-course status. 

Student readiness indices would include previous training activities, 
courses completed, formal education, and other objective training indices. 
His aptitude profile might include his test scores on the AFSC, ASVB, and 
other Air Force standardized aptitude tests. His current instructional 
status identifies how far along he has gotten in the course. Together 
these data enable control processing to almost instantaneously compute a 
predictor equation based on these variables. 

Once the predictor equation is determined, the computer system 
translates it into an appropriate flexi level test whose difficulty and 
scope are adjusted to the student's predicted performance. He is therefore 
provided an evalujition experience individually tailored to his current 
status. He executes this test on the computer terminal, a useful medium 
not only because of its rapid response but also because of its transitory 
display, which augments test security. 

The student enters the test at the difficulty level that has been 
predicted appropriate. If he misses an item, he continues down the diffi- 
culty scale until he gets one correct. This establishes his in-test per- 
formance base, from which subsequent flexi level items orginate. 

When he has completed the adjusted flexilevel test, the remaining 
test items are presented and the student responses are evaluated (see 
Figure 1). Green's scoring procedure will be used to evaluate the flexi- 
level portion of the test, while the entire test will be evaluated using 
standard Air Force perfomance criterion scoring procedures .135 Thus, for 
each student a tailored test score and a conventional test score will be 
available. If full mastery, based on the entire test score, is achieved, 
the student is provided the opportunity to take the next lesson. If he 
elects to, he then reenters control processing and begins the same sequence 
in the next assigned flexilevel test. 

In the case of test failure, the student goes offline for course 
remedial activities keyed to his learning deficiencies. Following remedi- 
ation, all students reenter control processing and restart the flexilevel 
testing cycle. 



See footnote 89 on page 29, 
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Flowchart of student progress through flexi level test- 
ing program. 
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After the student attains performance mastery, as a result of 
either the initial or post remedial 50-item test score, the system then 
decides if he should continue to the next test. If time permits, he 
most likely will be routed to control processing for a performance pre- 
diction update and subsequept testing. If further testing is not pre- 
scribed» he is signed off.^^^ 

3.4 Study Two—Hierarchical Learning Assessment 

The hierarchical learning study will be performed within Block I, 
Precision Measuring Equipment Specialist (PME) course. Using a within- 
subject design (N = 100) for validation purposes, an individualized entry 
prediction based on regression techniques (AFQT, math, mechanical, elec- 
tronic, and prior math instruction) and a pretest pass-fail unit movement 
will be the independent variables. All students will be required to attempt 
all test items with embedded flexi level branching and subsequent full test- 
ing. The dependent measures of total test score, item latencies, and test 
time will be comparatively analyzed with the adaptive test measures as 
described in Study One. Besides the reliability analyses, learning time 
and patterns of unit performance levels will be evaluated. The Gagne pass- 
fail matrix techniques will be utilized. 

3.4.1 Computer testing paradigm. The hierarchical testing paradigm 
- is a strategy designed to minimize testing time while maintaining accurate 
assessment of the student's level of mastery. Also, it includes prescribing 
the level of instruction which is most appropriate to the student's real 
time performance status. 

A flowchart of the hierarchical testing paradip. Figure 2, is 
included to indicate and clarify the conceptual steps in implementing 
this strategy. The ^ flowchart illustrates how a student is introduced to 
the testing paradigm after instruction up to lesson N. An appropriate 
test is initially administered. If the outcome indicates extreme results, 
the student's level of mastery is significantly different from what was 
anticipated; consequently a more appropriate test is administered. When 
a test which is neither too advanced nor too elementary is administered, 
the student's response is characterized by moderate performance. At this 
point, information from all testing and previous performance is summarized 
to prescribe either (a) remediation, or (b) the next hierarchical lesson, 
or (c) more advanced instruction,. Each of these processes is described 
below. 

/ 

1. After the student signs on to the computerized testing system, 
the system analyzes his aptitudes, performance, and current level of 
instruction. It'then prescribes, using regress ion -determined prediction 
equations, the test which is at a level most appropriate for the antici- 
pated performance of the student. 
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^^^Technical documentation for the described flexilevel testing is 
available to the interested reader on request, by contacting the Air Force 
Human Resources Laboratory, Lowry Air Force. Base, Colorado 80203. 
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2. A traditional flexi level midpoint entry test is administered 
at the predicted level of mastery, and scored using Green's algorithm. 

3. An extreme failure indicates that the student's level of mastery 
is lower than expected. 

4. If the student is not at the bottom of the hierarchy, but is 
lower than predicted, a lower test is available and is prescribed. If 
the student is at the lowest level in the hierarchy, no lower test will 
exist, so remediation is administered to raise his level of performance. 

5. An extreme success indicates that the student's level of mastery 
is higher than expected, and a readjustment is in order. 

6. If the student is not at the top of the hierarchy, but is higher 
than predicted, a higher test is available and is prescribed. If the 
student is at the top of the hierarchy and has had an extreme success, he 
has mastery of all the material and can be released from further instruc- 
tion in this area. 

7. If the student has done neither extremely poorly nor extremely 
well, it is clear that his level of mastery is close to that reflected 
in the test just received. With this information and previous per- 
formance data, a decision about appropriate instruction is made with 
high confidence. 

8. If the level of performance is shown to be below what is needed 

to proceed to the next lesson, remediation is administered and the student 
skills are reassessed through the testing procedure. 

9. If, on the other hand, the student has demonstrated J^roficiency 
at a higher level than that of the next lesson, he is assigned to the 
most appropriate level (with potential for considerable savings of instruc 
tion time). 

10. After all is considered, if the student indicates his performance 
is neither behind nor ahead of the expected level, given his amount of 
instruction, the next lesson in normal sequence is assigned. 

11. With an accurate assessment of the level of mastery, and the 
appropriate assigned lesson of instruction, the student leaves the 
testing environment and resumes instruction with the minimum amount of 
unnecessary testing or instruction. 

3. 5 Study Three--Cri terion Zone Decision Study 

Using the result of Studies One and Two, appropriate critical 
zone criteria will be developed for identifying marginal students. 
(Subsequent performance, that is, pass-fail patterns, will be used to 
establish these criteria.) The adaptive computer test will be reorganized 
so that students scoring in this zone will be randomly assigned to either 
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(a) A Robbins Monro critice'' zone sequential test, or (b) analysis of 
error alternatives using the Bock procedure. Ideally, this study would 
be perfortned in both IM and PME. Depending on student flow, at least 
50 students in each condition would be compared. Analysis will focus 
on performance in both the next unit and block. Estimates of time 
savings will be based on percentage of students in the critical zone and 
savings on nonwashbacks . Reliability will be based on the performance 
on the subsequent test. 

3.6 Feasibil ity Results 

Since the predominant activity focused on an optimal design and 
implementation, limited feasibility data or observations can be made 
concerning the project. Perhaps the most important of these deals with 
manpower requirements and associated costs. The total effort for the 
project is approximately one man year. Approximately three-fourths of 
this man effort was devoted to the study of research, the Air Force 
course analysis, and the design of the three studies. Approxmately one 
quarter of the man effort was devoted to computer implementation and data 
analysis. The important observation is that, given an operational general 
purpose computer with a modern time -sharing alphanumeric-oriented language, 
adaptive testing can be implemented in very brief periods of time. The 
PLATO language was especially well oriented for the preparation of item 
presentation. The regression equations presented slightly more problems, 
but not of a significant nature. Two major drawbacks to the PLATO system 
exist, 6n« is unreliability (at times as much as three days' work effort 
was lost due to system failure). Secondly^ the lack of a general file 
handling system for storage and retrieval is a problerr>37 

Feasibility in terms of liaison and cooperation with ATC course 
personnel can be characterized as successful. In the early stages of 
cooperative information sharing, questions of the appropriateness and 
test security of this approach raised skepticism. When the instructors 
were able to see the test items presented on the computer terminal, how- 
ever, they could more accurately determine the equivalency of the testing 
method and the capability for time savings for their ongoing instruction. 
At this stage, the ATC instructional personnel can be characterized as 
highly cooperative, and interested. 
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University of Illinois personnel are implementing.such a file 
handling system at this time, according to reports given the project 
team. 
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IV. CONCLUSIONS AND RECOMMENDATIONS 



4.0 Overview 

As presented in the prior sections, this project has successfully 
demonstrated that theoretical models dealing with adaptive testing can be 
incorporated within the operational measurement requirements of ATC in 
order to both support ongoing testing, and assess the validity and effec- 
tiveness of a computer-based approach for technical training. As a general 
overview, the design and developmental work to implement adaptive testing 
proceeded in a most efficient and expeditious fashion. The reacily accep- 
tance by ATC instructional personnel and its implications for ongoing 
operational application speaks to this obvious feasibility. Therefore, 
the conclusions shall be framed within realization that the adaptive 
testing models (a flexilevel model for the Inventory Management course 
and a hierarchical testing model for the Precision Measuring Equipment 
course) are currently available and can be implemented when existing 
terminals and implementation support become available. This report now 
turns to specific conclusions which are framed within the three adaptive 
testing processes (entry, item tailoring, and scoring/interpretation/ 
reporting). 



4p1 Entry Processes 

Use of student characteristics (e.g., AFQT) and course performance 
variables allowed for an individualized variable entry process via linear 
regression prediction techniques. While the stepwise multiple regression 
coefficients were moderate in magnitude, the individualized entry should 
allow for a significant testing time reduction. Moreover, entering each 
student at his predicted difficulty levels should improve the psychometric 
characteristics of the process due to the standardization effect and known 
discrimination effects of presenting test items at the .5 level. 



4.2 Tailoring Testing of Item Presentation 

The operational feasibility of tailoring items to a student's within 
test performance was documented by tjiiis study. This flexible procedure should 
allow for time savings of up to 50 percent. The functional interrelation of 
testing and training within the hierarchical model should yield even more 
total time savings. The plans for sequentially expanding the criterion test 
zone should yield even more reliable testing decisions. Moreover, the tech- 
niques for analyzing error responses were also implemented in the computer 
routines. This facet of the study most evidently demonstrates the feasi- 
bility and potential of computer-based adaptive testing. 



4.3 Scoring, Interpretation, and Reporting 

The scoring routines allow for a conventional summed total correct 
score and an average item difficulty value which can be converted into a 
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percentile score. Unfortunately, only a limited conventional type student 
and instructor report were demonstrated. Future developments will undoubtedly 
indicate the need for verbally-oriented reports. 



4.4 Computer Implementation 

The University of Illinois Plato System proved to be more than 
satisfactory for the implementation of the adaptive testing models. As 
more individualized test composition and reporting procedures are pursued, 
an improved file handling and report generation capability will have to be ' 
available on the computer system. Moreover, improved editing procedures are 
required if day-to-day revisions are to become operational. Finally, the 
design coding, debugging, and documentation of the computer-based adaptive 
testing module with only five man months of effort illustrates the cost- 
effectiveness of this approach. 



4.5 Recommendations 

Given the success of this demonstration and feasibility study, the 
following recommendations appear evident: 

1. The empirical validation of the adaiUive testing models is paramount* 
This validation process should take three fonns. First, the concurrent and 
predictive validity of the adaptive testing scores should be related to con- 
ventional test scores. Second, the time savings stratified by test type 
(knowledge, hierarchical, problem solving, etc.) should be analyzed according 
to cost-effectiveness techniques. Third, the utility of the reports to stu- 
dents and Instructors should be assessed to structure any required future 
extensions and maximize the impact of the measurement pi^ocess. 

2. Future research should focus on the comparative benefits of the 
flexi level routines as opposed to Bayesian or Robbins Monro procedures* 
Those models yielding the greatest joint time savings and amplification of 
the criterion decision zone should be empirically verified. 

3. As the adaptive testing model establishes its empirical validity, a 
design study of its more dynamic integration into an adaptive instructional 
model (that is, the role and implication of adaptive testing as both a 
training strategy and as evaluative feedback to the training system) and its 
requirements for efficient computer implementation (that is, improved data 
file handling, editing for revision, and flexible verbal reports) should be 
pursued. 
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APPENDIX A 
COMPUTER PROGRAM OUTPUT 



IM Block III Form B. Norm-Referenced Analysis Output 

IM Block III Form B Criterion-Referenced Analysis Output 

IM Block IV Form A Norm-Referenced Analysis Output 

IM Block IV Form A Criterion-Referenced Analysis Output 
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